last modified Jun 08, 2020 02:14
You are here: Home / DICAD / Subproject 1 (DKRZ) / Data Workflows / Management and Analysis Phase

Management and Analysis Phase

This chapter summarizes the services and activities supporting the national climate community in accessing and analysing CMIP based model data, especially in the context of CMIP6. A centrally managed large data pool is established, which is the basis for ESGF based data distribution as well as efficient data analysis activities on DKRZ HPC resources. A data management workflow ensures the timely ingestion of climate model data originating from national modeling groups (primary data ingest and publication) as well as groups around the world (data replication).

The core components of the data management workflow centered around the CMIP data pool are indicated in the picture below with the numbers 1 to 4: 

 

 

Data management workflow

 

1: Data quality assurance

Based on the quality assurance tool developed at DKRZ all data from national modeling groups is checked before ingestion into the CMIP data pool starts. To support modeling groups in early spot-checking of individual data files a web based service has been established.

.. details like references and documentation to be added ..

 

2: ESGF data publication

Based on ESGF data node installations at DKRZ the data is published to the DKRZ ESGF portal. This portal provides functionality to register, search for data and access data and thus makes the data visible and accessible for the national as well as worldwide research community.

The ESGF data publication also includes steps to make the data referencable and citable: All ESGF published data sets are assigned persistent identifiers (PIDs) and also linked to citation information. In this context DKRZ also acts as a central provider of PID and data citation services in the worldwide ESGF data federation.

All ESGF published data is ingested and managed as part of the overall DKRZ CMIP data pool (see 4.)

3. ESGF data replication

Climate data analysis activities often involve very large data collections including data from modeling centers around the world. Yet downloading and maintaining these high volume datasets at the home institutes of climate researchers is time consuming and very inefficient. Thus climate researchers are supported by central data replication activities and services at DKRZ. Important and oftenly needed data collections are replicated automatically from remote sites and are made locally accessible as part of the ESGF data pool. The replication support team at DKRZ can be reached via the email data_replication /at/ dkrz.de.

4. CMIP data pool: access and management

The CMIP data pool provides the storage resources to make high volume data collections efficiently accessible for the national climate community. Overall 5 Petabyte of data are reserved and made available as part of the overall DKRZ HPC lustre storage resources. These 5 Petabytes need to be consistently managed based on the different user needs as well as based on international agreements with respect to provisioning of replicas.

Priorities with respect to data storage are decided within a review board.